Skip to content

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881

Merged
res-life merged 1 commit into
NVIDIA:mainfrom
res-life:iceberg-1.11/pr1-common-shim
Jun 5, 2026
Merged

Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim#14881
res-life merged 1 commit into
NVIDIA:mainfrom
res-life:iceberg-1.11/pr1-common-shim

Conversation

@res-life

@res-life res-life commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Stacked work for #14853 (1/3) — common-code preparation for adding iceberg-1-11-x.

Depends on

Description

Refactors iceberg/common so the SparkScan / SparkCopyOnWriteScan / SparkBatch / DataWriteResult APIs that diverge between Iceberg 1.10.x and 1.11.x are hidden behind a small interface, with per-Iceberg-version implementations in iceberg-1-6-x / iceberg-1-9-x / iceberg-1-10-x. No behavior change for the Iceberg versions this PR ships; sets the stage for the follow-up PR that adds iceberg-1-11-x.

Common changes:

  • GpuSparkCopyOnWriteScan → renamed to GpuSparkCopyOnWriteScanBase (abstract). The runtime-filter trait + filter method live in a per-version concrete subclass (1.6/1.9/1.10 mix in SupportsRuntimeFiltering with filter(Filter[]); 1.11 will mix in SupportsRuntimeV2Filtering with filter(Predicate[])).
  • GpuSparkScan: rewrite hasNestedType via Spark's readSchema() + Spark types so it no longer depends on the 1.10-only cpuScan.expectedSchema(). Dispatch SparkCopyOnWriteScan construction through the new ShimUtils.newCopyOnWriteScan factory.
  • GpuSparkBatchQueryScan.toString uses cpuScan.description() (available in both 1.10 and 1.11) instead of branch / expectedSchema / filterExpressions (1.11 removed these).
  • GpuSparkBatchQueryScan.runtimeFilterExpressions reflective field-read tolerates both the 1.10 name (runtimeFilterExpressions) and the 1.11 name (runtimeFilters).
  • GpuSparkBatch: same tolerance for expectedSchema (1.10) vs projection (1.11).
  • GpuSparkWrite: type-annotate new Array[DataFile](0) so Scala 2.13 doesn't infer Array[Nothing] under 1.11's wildcarded DataWriteResult.dataFiles().
  • IcebergShimUtils / ShimUtils: add newCopyOnWriteScan(Scan, RapidsConf, Boolean): GpuScan factory. The parameter is Spark's public Scan because Iceberg's SparkCopyOnWriteScan is package-private — cross-package callers cannot reference it directly.

Per-Iceberg-version module changes (1.6 / 1.9 / 1.10, all identical for the V1 path):

  • New GpuSparkCopyOnWriteScan in org.apache.iceberg.spark.source (so it can reference the package-private SparkCopyOnWriteScan). Companion object exposes create(Scan, ...): GpuScan for cross-package callers.
  • ShimUtilsImpl.java implements newCopyOnWriteScan via GpuSparkCopyOnWriteScan.create.

The two try/catch field-name fallbacks (in GpuSparkBatchQueryScan and GpuSparkBatch) are tactical and will be pushed behind proper per-version IcebergShimUtils methods in a later cleanup PR.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (3.5.x + 4.0.x iceberg integration tests in `integration_tests/src/main/python/iceberg/` — exercises the new dispatch path with no behavior change vs. before this PR.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@gerashegalov gerashegalov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should ideally build on top of #14866

@res-life res-life requested a review from a team May 28, 2026 02:48
@res-life res-life marked this pull request as ready for review May 28, 2026 02:49
@res-life

Copy link
Copy Markdown
Collaborator Author

build

@greptile-apps

greptile-apps Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR is a preparatory refactor for Iceberg 1.11.x support (no behavior change for 1.6.x/1.9.x/1.10.x). It hides API surface that diverges between Iceberg versions behind a small shim interface, and updates reflective field/method access in GpuSparkScanAccess to tolerate renamed members across versions.

  • GpuSparkCopyOnWriteScan is split into an abstract base (GpuSparkCopyOnWriteScanBase in iceberg/common) and per-version concrete subclasses (1.6.x / 1.9.x / 1.10.x) that mix in the appropriate SupportsRuntimeFiltering trait; the 1.11.x subclass mixing in SupportsRuntimeV2Filtering will follow in a later PR.
  • GpuSparkScanAccess gains varargs overloads of readField / findField and a new invokeMethod helper, so callers can list candidate field/method names in priority order and the helper resolves whichever name exists in the running Iceberg version.
  • IcebergShimUtils / ShimUtils grow a newCopyOnWriteScan factory; each per-version ShimUtilsImpl implements it by calling GpuSparkCopyOnWriteScan.create.

Confidence Score: 5/5

Safe to merge — no behavior change for existing Iceberg versions; the refactor correctly preserves all existing functionality while laying groundwork for 1.11.x.

All three per-version GpuSparkCopyOnWriteScan implementations are structurally identical to the old common class, the new factory dispatch is a straightforward delegation, and the reflective fallback logic in GpuSparkScanAccess is clearly reasoned with both candidate names listed and well-commented. The two observations flagged are minor diagnostic ergonomics and a theoretical field-priority edge case that doesn't affect any current caller.

GpuSparkScanAccess.java — the invokeMethod exception-wrapping tweak and the field-search priority comment are worth a look before the 1.11.x follow-up lands.

Important Files Changed

Filename Overview
iceberg/common/src/main/java/org/apache/iceberg/spark/source/GpuSparkScanAccess.java Major refactor of reflective access helpers: adds varargs field-name fallback in readField/findField and a new invokeMethod with similar fallback for cross-version method resolution; minor wrapping issue with InvocationTargetException
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScanBase.scala Renamed from GpuSparkCopyOnWriteScan; made abstract by removing SupportsRuntimeFiltering, filter(), filterAttributes(), and withInputFile() — these now live in the per-version concrete subclasses
iceberg/common/src/main/scala/org/apache/iceberg/spark/source/GpuSparkScan.scala Minimal change: dispatch SparkCopyOnWriteScan construction through the new ShimUtils.newCopyOnWriteScan factory instead of direct new GpuSparkCopyOnWriteScan
iceberg/iceberg-1-10-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala New concrete subclass for 1.10.x, extends GpuSparkCopyOnWriteScanBase and mixes in SupportsRuntimeFiltering; identical in structure to 1.6.x and 1.9.x versions as expected
iceberg/iceberg-1-6-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala New concrete subclass for 1.6.x, identical to 1.9.x and 1.10.x versions as expected for this V1 path
iceberg/iceberg-1-9-x/src/main/scala/org/apache/iceberg/spark/source/GpuSparkCopyOnWriteScan.scala New concrete subclass for 1.9.x, identical to 1.6.x and 1.10.x versions as expected for this V1 path

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["GpuSparkScan.tryConvert(cpuScan)"] --> B{isCopyOnWriteScan?}
    B -- Yes --> C["ShimUtils.newCopyOnWriteScan()"]
    B -- No --> D{isBatchQueryScan?}
    D -- Yes --> E["new GpuSparkBatchQueryScan"]
    D -- No --> F["IllegalArgumentException"]

    C --> G["IcebergShimUtils.newCopyOnWriteScan() [interface]"]
    G --> H1["iceberg-1-6-x ShimUtilsImpl"]
    G --> H2["iceberg-1-9-x ShimUtilsImpl"]
    G --> H3["iceberg-1-10-x ShimUtilsImpl"]
    G --> H4["iceberg-1-11-x ShimUtilsImpl (future)"]

    H1 --> I1["GpuSparkCopyOnWriteScan 1.6.x\nextends GpuSparkCopyOnWriteScanBase\nwith SupportsRuntimeFiltering / filter(Filter[])"]
    H2 --> I2["GpuSparkCopyOnWriteScan 1.9.x\nextends GpuSparkCopyOnWriteScanBase\nwith SupportsRuntimeFiltering / filter(Filter[])"]
    H3 --> I3["GpuSparkCopyOnWriteScan 1.10.x\nextends GpuSparkCopyOnWriteScanBase\nwith SupportsRuntimeFiltering / filter(Filter[])"]
    H4 --> I4["GpuSparkCopyOnWriteScan 1.11.x\nextends GpuSparkCopyOnWriteScanBase\nwith SupportsRuntimeV2Filtering / filter(Predicate[])"]

    subgraph common["iceberg/common (version-agnostic)"]
        A
        B
        C
        D
        E
        F
        BASE["GpuSparkCopyOnWriteScanBase (abstract)\nestimateStatistics / equals / hashCode / toString"]
    end

    I1 & I2 & I3 & I4 --> BASE
Loading

Reviews (2): Last reviewed commit: "Iceberg: prepare scan/bridge layer for 1..." | Re-trigger Greptile

@res-life res-life marked this pull request as draft May 28, 2026 02:59
@res-life res-life force-pushed the iceberg-1.11/pr1-common-shim branch 2 times, most recently from 4647fc3 to ac790ba Compare May 29, 2026 03:41
@res-life res-life changed the title Iceberg: extract version-divergent scan APIs behind a shim Iceberg 1.11 support for Spark 411, part (1/3): extract version-divergent scan APIs behind a shim May 30, 2026
This PR has no behavior change for the Iceberg versions currently shipped
(1.6.x / 1.9.x / 1.10.x). It makes two refactors that are required by the
upcoming iceberg-1-11-x module:

1) GpuSparkScanAccess version-tolerant via reflection.
   NVIDIA#14866 introduced GpuSparkScanAccess as a root-loadable bridge to
   Iceberg's package-private scan classes. Three of its methods called
   SparkScan.branch() / expectedSchema() / filterExpressions(), which
   Iceberg 1.11 removed/renamed (the latter two became projection() and
   filters() respectively). One field read used
   "runtimeFilterExpressions", which 1.11 renamed to "runtimeFilters" on
   the new SparkRuntimeFilterableScan parent class. Switching to
   reflection with priority-ordered candidate names lets the same
   common-code bridge compile and run against any Iceberg version.
   readField was extended to accept varargs field names; a new
   invokeMethod helper does the same for protected methods.

2) Per-Iceberg-version GpuSparkCopyOnWriteScan subclass.
   NVIDIA#14866's GpuSparkCopyOnWriteScan in common hardcodes
   SupportsRuntimeFiltering + filter(Filter[]). Iceberg 1.11 switches
   SparkCopyOnWriteScan to SupportsRuntimeV2Filtering + filter(Predicate[]),
   so the hardcode prevents 1.11 from honoring runtime filtering.
   GpuSparkCopyOnWriteScan -> renamed to abstract
   GpuSparkCopyOnWriteScanBase. The runtime-filter trait and filter()
   method live on a per-Iceberg-version concrete subclass shipped from
   each iceberg-1-N-x module. 1.6 / 1.9 / 1.10 mix in
   SupportsRuntimeFiltering with filter(Filter[]) (zero behavior change).
   1.11 will mix in SupportsRuntimeV2Filtering + filter(Predicate[]) in a
   follow-up PR.

   A new ShimUtils.newCopyOnWriteScan(Scan, RapidsConf, boolean)
   factory routes GpuSparkScan.tryConvert through the per-version
   ShimUtilsImpl so common code does not need to know which subclass to
   construct.

Verified by building buildver=350 (iceberg-1-6-x), buildver=356
(iceberg-1-9-x + iceberg-1-10-x co-shipped), and the full reactor
through this commit.

Signed-off-by: Chong Gao <res_life@163.com>
@res-life res-life force-pushed the iceberg-1.11/pr1-common-shim branch from ac790ba to a30d588 Compare June 3, 2026 02:16
@res-life

res-life commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

build

@res-life res-life marked this pull request as ready for review June 4, 2026 03:37
* Iceberg 1.10.x copy-on-write scan: {@code SupportsRuntimeFiltering} with
* {@code filter(Array[Filter])}.
*/
class GpuSparkCopyOnWriteScan(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking concision: the 1.6/1.9/1.10 implementations are identical apart from the version in the Scaladoc, and this class only depends on public Scan + SupportsRuntimeFiltering. Could the V1 path live once in common, e.g. GpuSparkCopyOnWriteV1Scan, with all three ShimUtilsImpls instantiating it? Then only the future 1.11 V2 path needs a version-specific class.

return sparkScan(scan).branch();
// Iceberg 1.10.x and earlier: protected method SparkScan.branch(). Iceberg 1.11.x
// removed it entirely; return null for display purposes.
return invokeMethod(sparkScan(scan), String.class, "branch");

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking: for the 1.11 follow-up, this will render branch=null for branch reads because Iceberg 1.11 removed SparkScan.branch() but SparkBatchQueryScan/SparkCopyOnWriteScan still carry a private branch field and include it in description(). Should this accessor fall back to reading the branch field, or should the GPU scan toString use cpuScan.description() as the PR description says?

@gerashegalov gerashegalov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@res-life res-life merged commit 0b46c1c into NVIDIA:main Jun 5, 2026
123 of 131 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants